Dados e R
Data Wrangling & DataViz

Encontro 2 | 19/08/20024
Henrique Costa | Métodos Estratégicos em FinQuant

Wrangle III

Summarize

  • Although we store data about many observations…
  • …we often want to summarize across observations
    • This is like folding the tibble down to one row
  • We’ve seen functions that summarize vectors
    • length(), sum(), min(), max()
    • mean(), median(), sd(), var()
  • summarize() lets us use them on tibbles
    • It works very similarly to mutate()
    • It always creates a tibble as output

Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# USECASE: Summarize the typical sales

my_summary <- 
  sales |> 
  summarize(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# PITFALL: Don't use summary() instead of summarize()

my_summary <- 
  sales |> 
  summary(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print() # not a tibble

# ==============================================================================

# USECASE: Use more than one summary function

my_summary <- 
  sales |> 
  summarize(
    total_items = sum(items),
    total_spent = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# USECASE: Use counting functions

my_counts <- 
  sales |> 
  summarize(
    n_sales = n(),
    n_customers = n_distinct(customer),
    n_stores = n_distinct(store)
  ) |> 
  print()

Group Summarize

  • We can also summarize a tibble by group
    • This is like folding the tibble multiple times
    • Specifically, we fold down to one row per group
  • The syntax for summarize is identical
    • The only difference is to the tibble
    • We first pass it through group_by()
    • Pipelines make this very easy

Group Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# LESSON: We pass a tibble through group_by to group it

sales

sales |> group_by(store) # note the display says "grouped"

# ==============================================================================

# USECASE: We can then summarize and get stats per group

sales |> 
  group_by(store) |> 
  summarize(
    customers = n_distinct(customer),
    items_sold = sum(items),
    total_sales = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  )

# ==============================================================================

# SETUP: Let's get a larger, more realistic dataset

# Extra pane > Packages tab > Install > nycflights13

library("nycflights13")

flights

# ==============================================================================

# USECASE: Find the carrier with the lowest average delays

flights |> 
  group_by(carrier) |> 
  summarize(m_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(m_delay)

# ==============================================================================

# LESSON: We can also group by multiple variables

# USECASE: Let's find the day of the year with the most flights

flights |> 
  group_by(month, day) |> 
  summarize(n_flights = n()) |> 
  arrange(desc(n_flights))

Visualize I

What is a graphic?

A data visualization expresses data through visual aesthetics.

Describing Graphics

Some simple graphics are easy to describe and may even have ready names.

Describing Graphics

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

A grammar of graphics will help us describe more complex graphics.

The Grammar of Graphics

  • The grammar of graphics is a set of rules for describing and creating data visualizations
  • To make our data visual (and therefore put our highly evolved occipital lobes to work)…
    • We connect variables to visual qualities
    • We represent observations as visual objects
  • This requires four fundamental elements
    • We will first learn about them in lecture
    • We will then apply them in R using {ggplot2}

Data

mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Graphics require data (e.g., tibbles), which describe observations using variables.

Aesthetic Mappings

Graphics require aesthetic mappings, which connect data variables to visual qualities.

Scales

Graphics require scales, which connect specific data values to specific aesthetic values.

Geometric Objects

Graphics require geometric objects (geoms), which represent the observations.

ggplot2 Basics

  • The ggplot2 package is a part of tidyverse
    • No need to install or load it separately
    • It plays nicely with tibbles and wrangling
  • It implements the grammar of graphics in R
    • The “gg” stands for “grammar of graphics”
    • Thus, we will need to provide all four elements
  • We will create a pseudo-pipeline of commands
    • However, we will use + rather than |>
    • This is because {ggplot2} predates the R pipe

ggplot2 Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# LESSON: First, set the data to a tibble
p <- ggplot(data = mpg)
p

# ==============================================================================

# LESSON: Next, set the aesthetic mappings with aes()

p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
p

# ==============================================================================

# TIP: You can leave off the optional argument names

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

# ==============================================================================

# LESSON: Next, set the positional scales

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  )
p

# ==============================================================================

# LESSON: Finally, add a point geom

p <- 
  ggplot(mpg, aes(x = displ, y = hwy)) + 
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  ) +
  geom_point()

# ==============================================================================

# TIP: If you leave off the scales, R will try to guess

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p

# ==============================================================================

# LESSON: We can also customize the geom with arguments

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(color = "red", shape = "square", size = 2)
p

Basic Layering

  • ggplot2 uses a layered grammar of graphics
    • We can keep stacking geoms on top
  • Layering adds a lot of possibilities
    • We can convey more complex ideas
    • We can learn more about our data
  • But we can still describe these graphics
    • Just describe each layer in turn
    • And describe the layers’ ordering

Basic Layering Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Add a smooth geom (i.e., line of best fit)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

# ==============================================================================

# USECASE: Add a line geom (i.e., connecting points)

economics

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point() +
  geom_line(color = "orange", size = 1)

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_line(color = "orange", size = 1) +
  geom_point()

# ==============================================================================

# USECASE: Add reference line geoms

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_vline(xintercept = 7.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_abline(intercept = 4000, slope = 0.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

Working with Color

  • Color scales come in two main types:
    • Discrete scales have separate colors
      • Best with factor variables
    • Continuous scales form a gradient
      • Best with numeric variables
  • There are two ways to control color:
    • You can map color to a variable
      • It will take on different values
    • You can set color to a value
      • It will take on one value only

Color Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Continuous color scales work well with numeric variables

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4)

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4) +
  scale_color_continuous(type = "viridis")

# ==============================================================================

# USECASE: Use a discrete color scale with categorical variables

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  scale_color_discrete(
    name = "Drivetrain", 
    breaks = c("4", "f", "r"), 
    labels = c("Four Wheel", "Front Wheel", "Rear Wheel")
  )

# ==============================================================================

# PITFALL: Don't forget to set categorical variables as factors

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) + 
  geom_point() # R guesses you want a continuous scale

ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) + 
  geom_point() + 
  scale_color_discrete(name = "Cylinders")

# ==============================================================================

# LESSON: Set a geom's color aesthetic to make it always that color

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "red")

# ==============================================================================

# PITFALL: However, do this inside of geom() not aes()

ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + 
  geom_point() #unintended

# ==============================================================================

# LESSON: If you both set and map color, the setting will win

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point(color = "blue") 

Themes

  • Themes control how non-data elements look
    • e.g., how thick to draw the gridlines
    • e.g., where to position the legend
  • Complete themes change many elements at once
    • Some are built into ggplot2
    • Others come in R packages
    • {papaja} provides theme_apa()
  • Individual elements can be customized too

Themes Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- 
  ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  labs(title = "Fuel Efficiency")
p

# ==============================================================================

# USECASE: Apply a "complete" theme

p + theme_bw()

p + theme_classic()

p + theme_dark()

# ==============================================================================

# LESSON: More more precise control, we can use theme()

p + theme(legend.position = "top")

p + theme(plot.title = element_text(color = "purple", face = "bold"))

p + theme(panel.grid = element_blank())

# NOTE: There are a lot of elements to learn, so use a cheatsheet!

Exporting Graphics

  • We may need to export graphics from R
    • e.g., for a paper, poster, or presentation
  • This job is handling fantastically by ggsave()
    • We can create many types of files
    • We can customize the exact size
  • I recommend .png for most daily purposes
    • For publishing, I prefer .pdf or .svg
    • They retain perfect quality at any zoom
    • You can send these files to most publishers

Exporting Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth() +
  labs(x = "Engine Displacement", y = "Highway MPG")
p

# ==============================================================================

# USECASE: Save a specific ggplot object to a file

ggsave(filename = "pfinal.png", plot = p)

# ==============================================================================

# LESSON: Specify the size of the file to create

ggsave(filename = "pfinal2.png", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# LESSON: Just change the extension to create a different file type

ggsave(filename = "pfinal2.pdf", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# PITFALL: Creating a very large file may lead to small text

ggsave(filename = "p_poster.png", plot = p, 
       width = 12, height = 8, units = "in")

# ==============================================================================

# TIP: You can quickly increase the text size using base_size

p2 <- p + theme_grey(base_size = 24)

ggsave(filename = "p_poster2.png", plot = p2,
       width = 12, height = 8, units = "in")

Practice IV